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STATEMENT  A 


Abstract 


In  this  paper  we  present  a  new  class  of  language  models.  This  class  derives  from  link  grammar, 
a  context-free  formalism  for  the  description  of  natural  language.  We  describe  an  algorithm  for 
determining  maximum-likelihood  estimates  of  the  parameters  of  these  models.  The  language  models 
which  we  present  differ  from  previous  models  based  on  stochastic  context-free  grammars  in  that 
they  are  highly  lexical.  In  particular,  they  include  the  familiar  n-gram  models  as  a  natural  subclass. 
The  motivation  for  considering  this  class  is  to  estimate  the  contribution  which  grammar  can  make 
to  reducing  the  relative  entropy  of  natural  language. 
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Introduction 


Finite-state  methods  occupy  a  special  position  in  the  realm  of  probabilistic  models  of  natural 
language.  In  particular,  the  simplicity,  and  simple-mindedness,  of  the  trigram  model  renders  it 
especially  well-suited  to  parameter  estimation  over  hundreds  of  millions  of  words  of  data,  resulting 
in  models  whose  predictive  powers  have  yet  to  be  seriously  contested.  It  has  only  been  through 
variations  on  the  finite-state  theme,  as  realized  in  cached  models,  for  example,  that  significant 
improvements  have  been  made.  This  state  of  affairs  belies  our  linguistic  intuition,  as  it  beguiles 
our  scientific  sensibilities. 

In  the  most  common  probabilistic  model  of  context-free  phrase  structure  grammar  [8],  the  pa¬ 
rameters  are  the  probabilities  Pa(A  -*  B  C)  and  Pa(A  — ►  w),  where  A,  B  and  C  are  nonterminals, 
and  w  is  a  terminal  symbol.  For  natural  language,  experience  has  shown  that  this  model  only 
weakly  captures  contextual  dependencies,  even  if  the  set  of  nonterminals  is  sufficiently  rich  to  en¬ 
code  lexical  information,  a  goal  toward  which  many  unification-based  grammars  strive  [4].  More 
to  the  point,  the  cross-entropies  of  language  models  constructed  from  probabilistic  grammars  have 
so  far  been  well  above  the  cross-entropies  of  trigram  language  models  [3,  6, 14]. 

Link  grammar  is  a  new  context-free  formalism  for  natural  language  proposed  in  [13].  What 
distinguishes  this  formalism  from  many  other  context-free  models  is  the  absence  of  explicit  con¬ 
stituents,  as  well  as  a  high  degree  of  lexicalization.  It  is  this  latter  property  which  makes  link 
grammar  attractive  from  the  point-of-view  of  probabilistic  modeling. 

Of  course,  several  grammatical  formalisms  besides  link  grammar  have  been  proposed  which  are 
highly  lexical.  One  such  example  is  lexicalized  tree  adjoining  grammar  [12],  which  is  in  fact  weakly 
context  sensitive  in  generative  power.  While  this  formalism  is  promising  for  statistical  language 
modeling,  the  relative  inefficiency  of  the  training  algorithms  limits  the  scope  of  the  associated 
models.  In  contrast,  the  motivation  behind  constructing  a  probabilistic  model  for  link  grammar 
lies  in  the  fact  that  it  is  a  very  simple  formalism,  for  which  there  exists  an  efficient  parsing  algorithm. 
This  suggests  that  the  parameters  of  a  highly  lexical  model  for  link  grammar  might  be  estimated  on 
very  large  amounts  of  text,  giving  the  words  themselves  the  ability  to  fully  exercise  their  statistical 
rights  as  well  as  their  grammatical  proclivities.  In  this  way  one  can  hope  to  contest  the  unreasonable 
dominion  that  the  insipid  trigram  holds  over  probabilistic  models  of  natural  language. 


Link  grammar 


The  best  way  to  explain  the  basics  of  link  grammar  is  to  discuss  an  example  of  a  linkage.  Figure  1 
shows  how  a  linkage  is  formed  when  the  words,  thought  of  as  vertices,  are  connected  by  labelled 
arcs  so  that  the  resulting  graph  is  connected  and  planar,  with  all  arcs  written  above  the  words, 
and  not  more  than  one  arc  connecting  any  two  words.  The  labelled  arcs  are  referred  to  as  links. 


V  - 


\ 


&  > 


r 


! 


1 


f  1 

Stately,  {dump  Buck  Mulligan  came  from  the  stairhead  , 
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bearing  a  bowl  of  lather  on  which  a  minor  and  a  razor  lay  crossed. 

Figure  1 


A  usage  of  a  word  to  is  determined  by  the  manner  in  which  the  word  is  linked  to  the  right  and 
to  the  left  in  a  sentence.  In  Figure  1,  for  example,  the  word  “came”  is  seen  to  be  preceded  by  a 
subject,  and  followed  by  two  adverbial  phrases,  separated  by  a  comma.  This  usage  of  “came”  is 
characterized  by  an  S  connector  on  the  left,  and  two  right  EV  connectors,  separated  by  a  Comma 
connector.  We  can  thus  say  that  one  usage  of  the  word  “came”  is  ((S),  (EV,  Comma,  EV)).  Similarly,  a 
usage  of  the  word  “and”  is  ((M),  (S,H));  that  is,  it  may  coordinate  two  noun  phrases  as  the  subject  of 
a  verb.  Of  course,  the  labels  in  the  above  examples  are  quite  simple;  to  incorporate  more  structure, 
it  would  be  natural  for  the  connectors  to  be  represented  by  feature  structures,  and  for  linking  to 
make  use  of  unification. 

A  dictionary  specifies  all  possible  usages  of  the  words  in  the  vocabulary.  A  usage  will  also  be 
referred  to  as  a  disjunct,  and  is  represented  by  a  pair  of  ordered  lists 

d  =  ((l  my  lm- x,...,/i),  (ri,r2,...,n)). 

The  /,’s  are  left  connectors  and  the  r;’s  are  right  connectors.  Links  are  formed  for  a  word  W  with 
disjunct  d  by  connecting  each  of  the  left  connectors  of  d  to  a  right  connector  r;  for  some  word  Li 
to  the  left  of  W,  and  and  by  similarly  connecting  each  right  connector  Tj  of  d  to  the  left  connector 
lj  of  some  word  Rj  to  the  right  of  W.  In  Figure  1,  for  example,  the  left  M  connector  in  the  disjunct 
((H),  (S,H))  for  the  word  “and”  is  connected  to  the  right  H  connector  in  the  ((A),(H))  disjunct  for 
“mirror.”  The  lists  of  left  and  right-connectors  are  ordered,  implying  that  the  words  to  which 
l\,  h, . . .  are  connected  are  decreasing  in  distance  to  the  left  of  W,  and  the  words  to  which  rj  axe 
connected  are  decreasing  in  distance  to  the  right  of  W.  We  will  make  use  of  the  notation  which  for 
a  disjunct  d  =  ((/TO,  /m_i , . . . ,  l\),  (r*i,  r2, . . . ,  r„))  identifies  «/,•  =  f,-+j  in  case  *  <  m,  and  <j lm  =  nil. 
Similarly,  we  set  Tjt>  =  rJ+i  for  j  <  n  and  rn>  =  nil.  The  first  left  connector  of  d  is  denoted  by 
left[d\  —  lj,  and  the  first  right  connector  is  right [d]  =  rj.  Of  course,  it  may  be  that  left[d\  =  nil  or 
right  [d]  =  NIL.  In  short,  a  disjunct  can  be  viewed  as  consisting  of  two  linked  lists  of  connectors. 

A  parse  or  linkage  of  a  sentence  is  determined  by  selecting  a  disjunct  for  each  word,  and  choosing 
a  collection  of  links  among  the  connectors  of  these  disjuncts  so  that:  the  graph  with  words  as 
vertices  and  links  as  edges  is  connected,  the  links  (when  drawn  above  the  words)  do  not  cross, 
each  connector  of  each  chosen  disjunct  is  the  end  point  of  exactly  one  link,  and  the  connectors 
at  opposite  ends  of  each  link  match.  If  no  such  linkage  exists  for  a  sequence  of  words,  then  that 
sequence  is  not  in  the  language  defined  by  the  link  grammar. 

We  refer  the  reader  to  [13]  for  more  information  about  link  grammars.  That  report  describes  a 
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terse  notation  for  nse  in  writing  link  grammars,  the  workings  of  a  wide- coverage  link  grammar  for 
English,  and  efficient  algorithms  and  heuristics  for  parsing  sentences  in  a  link  grammar. 

Link  grammars  resemble  two  other  context-free  grammatical  formalisms:  categorial  grammars  [11] 
and  dependency  grammars  [7,  10].  Both  link  grammar  and  categorial  grammar  are  highly  lexical. 
The  cancellation  operator  in  a  categorial  grammar  derivation  is  similar  to  linking  process  in  a  link 
grammar.  In  fact,  it  is  possible  to  take  a  categorial  grammar  and  generate  an  equivalent  link  gram¬ 
mar.  (The  reverse  seems  to  be  much  more  difficult.)  Dependency  grammars,  like  link  grammars, 
involve  drawing  links  between  the  words  of  a  sentence.  However,  they  are  not  lexical,  and  (as 
far  as  we  know)  lack  a  parsing  algorithm  of  efficiency  comparable  to  that  of  link  grammars.  Our 
approach  to  probabilistic  modeling  of  grammar  depends  on  the  existence  of  an  efficient  parsing 
algorithm,  and  on  having  enough  flexibility  to  represent  the  bigram  and  trigram  models  within  the 
same  framework. 


The  Recognition  Algorithm 


An  algorithm  for  parsing  with  link  grammar  is  presented  in  [13].  The  algorithm  proceeds  by 
constructing  links  in  a  top-down  fashion.  The  recursive  step  is  to  count  all  linkages  between  a  left 
word  L  and  a  right  word  R  which  make  use  of  the  (right)  connector  /  for  L  and  the  (left)  connector 
r  for  R,  assuming  that  /  and  r  are  connected  via  links  that  have  already  been  made.  The  algorithm 
proceeds  by  checking  for  each  disjunct  d  of  each  word  L  <  W  <  R,  whether  a  connection  can  be 
made  between  d  and  l  or  r.  There  are  three  possibilities.  It  may  be  the  case  that  left[d\  links  to  / 
and  right  [d]  is  either  nil  or  remains  unconnected.  Or,  right  [d\  may  link  to  r  and  left[d\  is  NIL  or 
unconnected.  Alternatively,  it  may  be  that  left[d\  is  connected  to  /  and  right[d\  is  connected  to  r. 

As  a  matter  of  notation,  we’ll  refer  to  the  words  in  a  sentence  S  =  W0W2  •  •  •  W]v-i  by  using  the 
indices  0, 1, ...  N  —  1.  Also,  we’ll  introduce  the  boundary  word  Ws  for  convenience,  assigning  to 
it  the  single  disjunct  ((nil),  (nil)).  Each  word  0  <  W  <  N  has  an  associated  set  V{W)  of  possible 
disjuncts.  Let  c(L,R,l,r )  be  the  number  of  ways  of  constructing  a  sublinkage  between  L  and  R 
using  l  and  r,  as  described  in  [13].  Then  c(L,L  +  l,/,r)  is  equal  to  one  in  case  /  =  r  =  nil  and  is 
equal  to  zero  otherwise. 

The  following  is  a  recursive  expression  for  c(L,R,l,r),  on  which  the  dynamic  programming 
algorithm  of  [13]  is  based: 

c{L,R,l,r)  = 

^2  H  [  match(l,left[d\)  c(L,W,l>,<left[d\)  c(W,R,  right [d\,r) 

L<W<R  d£V{W) 

+  match(l ,  left[d\)  match(right[d],r)  c(L,  W, /o, <left[d])  c(W,R,  right[d\>,<r ) 

+  match( right [d] , r)  c(L,W,l,left[d])  c(W,R,  right [d]e>,«r)  ] 

Here  6  is  the  standard  delta  function  and  match  is  an  indicator  function,  taking  values  0  and  1, 
which  determines  whether  two  connectors  may  be  joined  to  form  a  link.  The  function  match  must 
only  satisfy  match(c,  NIL)  =  match(NTL,c)  =  0  for  any  connector  c,  but  is  otherwise  completely 
general,  and  could,  for  example,  take  into  account  unification  of  feature  structures.  The  term 
^nil(0  *8  included  to  prevent  overcounting  of  linkages.  Since  there  are  at  most  (^)  triples  (L,W,R) 
to  be  tried,  the  complexity  of  the  parsing  algorithm  is  0(D3  •  N3),  where  D  is  an  upper  bound  of 
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the  number  of  disjuncts  of  an  arbitrary  word  in  the  grammar.  The  total  number  of  linkages,  or 
panes,  of  the  sentence  S  =  Wo  •  •  •  Wn-\  is  J2d&>( o)  c(0,  N,  right  (d],N!L). 


The  Probabilistic  Model 


It  is  natural  to  develop  a  generative  probabilistic  model  of  link  grammar.  In  using  term  generative 
we  imply  that  the  model  will  assign  total  probability  mass  one  to  the  language  of  the  grammar. 
The  usual  probabilistic  model  of  context-free  phrase  structure  grammar,  given  by  the  parameters 
Pa(A  B  C)  and  Pa(A  —*  to),  also  has  this  property. 

Just  as  the  basic  operation  of  context-free  phrase  structure  grammar  is  rewriting,  the  basic 
operation  of  link  grammar  is  linking.  A  link  depends  on  two  connectors,  a  left  connector  l  and 
a  right  connector  r.  These  are  the  analogues  of  a  nonterminal  A  which  is  to  be  rewritten  for  a 
phrase  structure  grammar.  Given  l  and  r,  a  link  is  formed  by  first  choosing  a  word  W  to  link  to, 
followed  by  a  choice  of  disjunct  d  for  the  word.  Finally,  an  orientation  is  chosen  for  the  link  by 
deciding  whether  d  links  to  /,  to  r,  or  to  both  l  and  r.  In  fact,  we  may  also  take  into  account  the 
identities  of  the  words  L  and  R  to  which  the  connectors  l  and  r  are  associated.  This  suggests  the 
set  of  parameters 

Pr(  W,d,0\  L,R,l,r) 

for  a  probabilistic  model.  Here  0  is  a  random  variable  representing  the  orientation  of  the  link, 
which  we  will  allow  to  have  values  or  «-►,  in  case  d  is  linked  to  /,  to  r,  or  to  both  l  and  r. 

Of  course,  this  probability  may  be  decomposed  as 

¥r{W,d,0\  L,R,l,r)  =  ¥t(W  \  L,R,l,r)¥x{d  \  W,L,R,l,r)¥t(0  \  d,W,L,R,l,r)  . 

Since  we  are  forming  conditional  probabilities  on  a  set  of  events  which  is  potentially  quite  large 
for  a  reasonable  grammar  and  vocabulary  for  natural  language,  it  may  be  impossible  in  practice  to 
form  reliable  estimates  for  them.  We  thus  approximate  these  probabilities  as 

Pr(W,d,0\  L,R,l,r)^Px(W\  L,R,l,r)Pr(d\  W,/,r)Pr(0|  d,l,r)  . 

In  addition,  we  require  the  joint  probability  Pr(Wo,do)  of  an  initial  word  and  disjunct. 

The  probability  of  a  linkage  is  the  product  of  all  its  link  probabilities.  That  is,  we  can  express 
a  linkage  C  as  a  set  of  links  C  =  {(W,d,0,L,R,l,r)}  together  with  an  initial  disjunct  do,  and  we 
assign  to  C  probability 


Pr(5,£)  =  Pr(W0,do)  IIPr(  W^°  I  LMr) 

where  the  product  is  taken  over  all  links  in  C,  and  where  we  have  noted  the  dependence  on 
the  sentence  S  being  generated.  This  probability  is  thus  to  be  thought  of  as  the  probability  of 
generating  5  with  the  linkage  C.  The  cross-entropy  of  a  corpus  Si, 52,....  with  respect  to  the 
uniform  distribution  on  individual  sentences  is  then  given  by 

/7  =  -7-'£logj;Pr(S„£) 

«  c 

for  some  normalizing  term  7.  In  the  following,  we  will  describe  an  algorithm  to  determine  a  set  of 
parameters  which  locally  minimize  this  entropy. 


4 


Finite- state  approximations 


Link  grammars  may  be  constructed  in  such  a  way  that  the  corresponding  probabilistic  model  is 
a  finite-state  Markov  chain  corresponding  to  the  n-gram  model.  For  example,  the  link  grammar 
whose  corresponding  probabilistic  model  is  equivalent  to  the  bigram  model  is  depicted  in  Figure  2. 


///  a  b  c  d  e  f  g 
Figure  2:  A  bigram  model 

Suppose,  as  another  example,  that  the  grammar  is  made  up  of  a  dictionary  where  a  word  w  has 
the  set  of  disjuncts 


((*),  ( wx )) 

((*»),  (wy)) 

(( *W )  >  (nil)) 

where  x  and  y  represent  arbitrary  words  in  the  vocabulary.  The  disjunct  ((xy),  (ytu))  represents 
the  assumption  that  any  two  words  x  and  y  may  precede  a  word  w.  This  information  is  passed 
through  the  left  connector.  The  identity  of  the  previous  word  y  and  the  current  word  w  is  then 
passed  through  the  right  connector.  The  disjunct  ((*),(«>))  represents  the  modeling  assumption 
that  any  word  can  begin  a  sentence.  Finally,  the  disjunct  ((xy),  (nil))  allows  any  word  to  be  the 
last  word  in  a  sentence.  An  artificial  word  ///,  called  “the  wall,”  is  introduced  to  represent  the 
sentence  boundary  [13],  and  is  given  the  single  disjunct  ((nll),(*)).  Given  this  set  of  disjuncts, 
each  sentence  has  a  unique  linkage,  which  is  represented  in  Figure  3.  The  resulting  probabilistic 
model  is  precisely  the  familiar  trigram  model. 

^*-y*t>YtocYccS^d*V*fYf9A 

III  a  b  c  d  e  f  g 
Figure  3:  A  trigram  model 


Of  course,  since  the  generative  power  of  link  grammar  is  context-free,  any  finite  state  model 
can  be  represented.  The  point  to  be  made  with  the  above  example,  however,  is  that  because  of  the 
lexical  nature  of  the  probabilistic  model  that  is  being  proposed,  finite-state  language  models  such 
as  the  n-gram  model  and  its  derivatives  can  be  easily  and  naturally  represented  in  a  probabilistic 
model  of  link  grammar.  Probabilistic  link  grammar  thus  provides  a  uniform  framework  for  finite- 
state  as  well  as  linguistically  motivated  models  of  natural  language. 

In  order  to  capture  the  trigram  model  in  a  traditional  probabilistic  context-free  grammar,  the 
following  grammar  could  be  used,  where  Axy  is  a  nonterminal  parameterized  by  the  “previous” 
words  x  and  y. 
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However,  it  would  certainly  be  awkward,  at  best,  to  incorporate  the  above  productions  into  a 
natural  language  grammar.  The  essence  of  the  problem,  of  course,  is  that  the  Griebach  normal 
form  of  a  natural  language  grammar  rarely  provides  a  strong  equivalence,  but  rather  distorts  the 
trees  in  a  linguistically  senseless  fashion. 


rivemm  ,  past  Eve  and  Adam’s,  from  swerve  of  shore  to  bend  of  bay  , 


brings  us  by  a  commodius  vicus  of  recirculation  bade  to  Howth  Castle  and  Environs. 

Figure  4:  A  bigram/grammar  model 

On  the  other  hand,  the  corresponding  finite-state  links  could  be  easily  included  into  a  link  gram¬ 
mar  for  natural  language  in  a  manner  which  preserves  the  relevant  structure.  While  the  formalisms 
are  equivalent  from  the  point-of-view  of  generative  power,  the  absence  of  explicit  constituents  as 
well  as  the  head-driven  nature  of  link  grammar  lends  it  well  to  probabilistic  modeling.  As  an 
example,  in  the  linkage  displayed  in  Figure  3,  subject-verb  agreement,  object-verb  attachment, 
and  adverbial  clause  attachment  are  handled  using  grammar,  while  the  remaining  words  within 
each  clause  phrase  are  related  by  the  bigram  model.  In  addition,  the  logical  relation  between  the 
words  “from”  and  “to”  is  represented  in  a  link.  In  this  manner  long-distance  dependencies  can  be 
seamlessly  incorporated  into  a  bigram  or  trigram  model. 

The  Training  Algorithm 

We  have  developed  and  implemented  an  algorithm  for  determining  maximum-likelihood  estimates 
of  the  parameters  of  probabilistic  link  grammar.  The  algorithm  is  in  the  spirit  of  the  Inside- Outside 
algorithm  [8],  which,  in  turn,  is  a  special  case  of  the  EM  algorithm  [2].  The  algorithm  computes 
two  types  of  probabilities,  which  we  refer  to  as  inside  probabilities  Prj  and  outside  probabilities 
Pro.  Intuitively,  the  inside  probability  Pri(L,R,l,r)  is  the  probability  that  the  words  between  L 
and  R  can  be  linked  together  so  that  the  linking  requirements  of  connectors  /  and  r  are  satisfied. 
The  term  Ptq{L ,  R,  L,  r)  is  the  probability  that  the  words  outside  of  the  words  L  and  R  are  linked 
together  so  that  the  linking  requirements  outside  of  the  connectors  /  and  r  are  satisfied.  Given 
these  probabilities,  the  probability  that  the  sentence  Wo, Ws- 1  is  generated  by  the  grammar 
is  equal  to 

Pr(S)=  £  Pr(Wo,  do)  Prr(0,  IV,  right[d\,  nil)  . 

*>€T>(W0) 

The  inside  probabilities  are  computed  recursively  through  the  relations 
Pr  i(L,R,l,r)  = 

£  £  [  Pr(  i,i2,/,r)  Prr(T,W,/>,<./e/f[d])Pn(W,f2,npAt[d],r) 

L<W<R  d£V(W) 

4-  Pr(W,d,~|  L,R,l,r)?TI(L,W,l>,<left[d\)PTJ{W,R,right[d\>,<ir) 

-1-  Pr(  W,d,->  |  L,R,l,r)  Prr(I,  W,  l,  left  [d])  Pr T(W,  R,  right  [d]«>,  or)  ]  . 
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The  outside  probabilities  are  competed  by  first  setting 

Pro(0,iV,  ripht[d],  NIL)  =  Pr(W0,d) 

for  each  disjunct  d  €  V(Wq)  with  left[d\  =  nil.  The  remaining  outside  probabilities  are  then 
obtained  as  a  sum  of  four  terms, 

Pr0(I,  JZ,  /,  r)  =  Prg*(X,  JZ,  l,  r)  +  Pr  $*(L,  JZ,  /,  r)  +  Pr $kt>(L,  R ,  I,  r)  +  Pr  #kt(L,  R,  l,  r) 
where  these  probabilities  are  computed  recursively  through  the  following  relations: 

R>W  r 

[Pr  (W,d,+-  |  L,R,l,r)  Prx(W,  R,  right [d], r)  +  Pr  (  W, d,  |  L, JZ,/,  r  ) Pri(W; R,  right [d]t>, <r)] 


?4tkt*(W,R,right{d\»,<r)=  £  Y?to(L’RM* 

L<W  I 

[Pr(  W,d,-+  |  L,  JZ,  /,  r )  P  tx(L,  W,  l ,  left[d\)  +  Pr(W,d,~  |  L,R,l,r)  Pr r(X,  W ,  l> ,  <left[d\)] 


Pr gf*(X, W , /, left [d])  =  Y  E Pro(£, r)pr (  W,d,->  |  X,  JZ,/,r) Prr(  W, JZ, ripAf [d]>, <r) 

R>W  r 


PT”’ki(W,R,right[d\,r)=  £  X)  pro(£, iZ, /, r)Pr  (  W^d,  -  I  L,R,l,r)?Tj{L,W,l»,<left[d\)  . 

L<W  I 

The  expected  number  of  times  that,  for  example,  a  word  W  is  linked  to  words  L  and  R  through 
connectors  l  and  r  in  a  given  sentence  S  is  then  determined  by 

Count(W,L,R,l,r )  =  Pro(I,  JZ,/,r)Pr(5)-x  £  Pr(  W  |  £,JZ,/,r)  Pr(  d  |  W,l,r )  { 

dev(W) 

Pr(  «-  |  d,l,r)?Ti(L,WJ»,<left[d\)?TI{W,R,right[d\,r)  + 

Pr(  -  |  d,/,r)Prr(I,iy,/,/e/t[d])Pii(^,iZ,rH?ht[d]>,<ir)  + 

Pr(  ~|  d,/,r)Prr(I,W,/>,<j/e/f[dl)Pri(Vr,iZ,nff/.t[d]>,<.r)  }  . 

The  counts  for  the  parameters  Pr  (  d  |  W,l,r)  and  Pr  (  0  |  d,  /,  r )  are  obtained  in  a  similar  way. 
For  completeness,  we  list  the  expected  counts  below. 

Count(d ,  W, l, r)  =  Pr(S)-'  £ Pro(I, R , J, r)Pr (  W  |  L,  JZ,/,r)  Pr(  d  |  W,l,r)  { 

L,R 

Pr(  -  |  d,l,r)Pri(L,W,l>,<left[d\)PTi(W,R,  right[d],r)  + 

Pr(  |  d,^r)PrI(L,W,/,/e/^[d])P^^(VF,J^,^pht[d]t>,<.r)  + 

Pr(  ~  |  d,/,r)Prr(I,^,/>,<fe/f[d])Prt(^,fZ,n<7/i<[d)>,<ir)  }  . 
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C<mn*(-,<f,/,r)  =  Pr(S)-1  £  Pr©(Z,ff,/,r)Pr(  W  |  £,ff,/,r)  Pr(  d  |  W,/,r)  x 

L,WJt 

Pr(  -  I  d,  /,  r  )  Prr(I,  W, /►, <de/*[d])Pi*(W; R,  right [d], r) 

Count(->,d,l,r)  =  Pf(5)_1  Pro^^/^JPrCiyi  I,il,/,r)Pr(d| 

L,WJl 

Pr(  -  |  d,  /,  r ) Prx(X, W, /, left [d])Prr(  W, ff , right [d]», «r) 

C<nmt(«-,d,/,r)  =  Pr(S)-1  £  Pr0(I,.R,/,r)Pr(  W  |  Lt  R,l,r)  Pr(  d  \  W,l,r )  x 

L.WJi 

Pr(  -  j  d,  Z,  r  )  Prr(I,  W,  l>,  <.Ze/i[d])PiT(W,  P,  rtffht [d]t>,  «r) 


The  algorithm  for  obtaining  these  counts  is  derived  from  the  dynamic  programming  algorithm 
given  in  [13].  The  algorithm  involves  three  passes  through  the  sentence.  The  first  pass  computes 
the  inside  probabilities  in  much  the  same  way  that  the  basic  recognition  algorithm  computes  the 
number  of  linkages.  A  second  pass  computes  the  outside  probabilities.  Finally,  a  third  pass  updates 
the  counts  for  the  parameters  of  the  model  in  a  manner  suggested  by  the  above  equations. 

While  the  algorithm  that  we  have  outlined  is  in  the  spirit  of  the  inside-outside  algorithm,  the 
actual  computations  in  the  two  algorithms  are  quite  different.  First,  the  inside  pass  proceeds 
in  a  top-down  manner  for  link  grammar,  while  the  usual  inside-outside  algorithm  is  based  upon 
the  bottom-up  CKY  chart  parsing  algorithm.  On  the  other  hand,  while  the  outside  pass  for  link 
grammar  is  top-down,  it  differs  from  the  outside  pass  for  the  inside-outside  algorithm  in  that  the 
computation  is  structured  exactly  like  the  inside  pass.  Thus,  there  is  a  symmetry  that  does  not 
exist  in  the  usual  algorithm.  In  addition,  there  is  an  efficient  check  on  the  correctness  of  the 
computation.  This  lies  in  the  fact  that  for  each  word  W  in  a  given  sentence  S,  the  total  count 
Count(W,  L,  R,  /,  r)  must  be  equal  to  one,  where  the  sum  is  taken  over  all  L,  R,  /,  and  r 
which  occur  in  a  linkage  of  S. 


Smoothing 


Obtaining  reliable  estimates  of  the  parameters  of  probabilistic  language  models  is  always  a  fun¬ 
damental  issue.  In  the  case  of  the  models  proposed  above,  this  is  especially  a  concern  due  to 
the  large  number  of  parameters.  Several  methods  of  “smoothing”  the  estimates  naturally  suggest 
themselves.  One  such  approach  is  to  form  the  smoothed  estimates 

Pt(W\  L,R,l,r)  =  Tl6ltr(W)[\PT{W\  L,R )  +  (1  -  A)Pr(  W  \  L,R,l,r)\ 

where  jj,r(W)  is  equal  to  one  in  case  the  word  W  has  a  disjunct  that  can  link  to  either  l  or  r,  and 
zero  otherwise,  and  7  is  a  normalizing  constant.  This  method  of  smoothing  is  attractive  since  the 
probabilities  Pr  (  W  |  L,  R )  can  be  obtained  from  unparsed  text.  In  fact,  since  for  a  given  sentence 
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S  there  are  (^ *)  ways  of  choosing  words  that  may  potentially  participate  together  in  a  linking,  if 
we  assume  that  the  sentences  in  a  corpus  have  lengths  which  are  Poisson-distributed  with  a  mean 
of  25,  then  there  is  an  average  of  2604  word  triples  per  sentence,  or  approximately  100  times  the 
number  of  usual  trigrams.  We  can  view  the  probability  Pr(  W  |  L,R)  as  the  prior  probability 
that  the  triple  (X,  W,  R)  forms  a  grammatical  trigram. 

Having  obtained  the  maximum-likelihood  estimates  of  the  parameters  of  our  model,  we  may 
then  obtain  the  posterior  probabilities  of  grammatical  trigrams  as 

P^(W|  X,JR)  =  J]P^(W|  £,J2,/,r)Pr(/,r|  L,R) 
l,r 

Here  the  probabilities  Pr(  /, r  |  L,R)  are  obtained  through  the  joint  probabilities  Pr(X,/2,/,r) 
which  are  estimated  through  the  expected  counts 

Count(L,R,l,r)  =  Pto(L,  R,ltr)PTx(L,R,l,r). 

Further  refinements  to  the  smoothed  distributions  can  be  nade  using  standard  methods  of  deleted 
interpolation  [1]. 


Prospects 


The  above  class  of  models  can  be  extended  in  many  different  directions.  For  example,  decision 
trees  can  be  used  to  estimate  the  probabilities  as  we  have  in  done  in  various  other  problems  [4,  5]. 
Increasing  the  complexity  of  the  models  in  this  manner  can  promote  the  generative  power  to  the 
class  of  context-sensitive  languages.  From  a  less  formal  point  of  view,  such  an  extension  would 
allow  the  statistics  to  better  capture  the  long-range  dependencies  which  are  inherent  in  any  large 
corpus.  But  the  essence  of  the  class  of  probabilistic  models  that  has  been  proposed  is  that  the 
parameters  are  highly  lexical,  though  simple.  In  proceeding  to  actually  carry  out  a  program  for 
constructing  such  models,  one  can  at  least  begin  to  reach  for  the  gauntlet  [6]  that  has  been  thrown 
down  in  the  name  of  the  maligned  trigram. 
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